Data Science is a field of science. We try to extract useful information from data. In order to use the data efficiently and correctly we must understand the data first. According to the goal of the study, combining the domain knowledge, we then design the study. In this lecture we first go through some basic explore data analysis to understand the nature of the data, some plausible relationship among the variables.
Data mining tools have been expanded dramatically in the past 20 years. Linear model as a building block for data science is simple and powerful. We introduce/review simple linear model. The focus is to understand what is it we are modeling; how to apply the data to get the information; to understand the intrinsic variability that statistics have.
Contents:
MLPayData_Total.csvCase Study
EDA
Appendices
Technical Note:
We hide all the outputs of R chunks by setting results = "hide" globally using knitr::opts_chunk$set() in the first chunk. You can setresults = "markup" as default or results = "hold" in the global setting to show outputs of every chunk; or in the chunk headers to show outputs for that chunk. Setting results = "hold" holds all the output pieces and push them to the end of a chunk.
Baseball is one of the most popular sports in US. The highest level of baseball is Major League Baseball which includes American League and National League with total of 30 teams. New York Yankees, Boston Red Sox, Philadelphia Phillies and recently rising star team Oakland Athletics are among the top teams. Oakland A’s is a low budget team. But the team has been moving itself up mainly due to its General Manager (GM) Billy Beane who is well known to apply statistics in his coaching.
Questions of interests for us:
Read an article: Billion dollar Billy Beane.
Data: baseball.csv, consists of winning records and the payroll of all 30 ML teams from 1998 to 2014 (17 years). There are 162 games in each season. We will create the following two exaggerated variables:
payroll: total pay from 1998 to 2014 in billion dollars
win: average winning percentage for the span of 1998 to 2014
keep team names team.
To answer Q1:
How does payroll relate to the performance measured by win?
(Oakland A’s: payroll=.84, win=.54 and Red Soc payroll=1.97, win=.55 )
payroll, and win.Data preparation:
Take a quick look at the data and create exaggerated variables.
baseball <- read.csv("baseball.csv")
datapay <- baseball %>% # create a small dataset with , team, payroll and win
group_by(team) %>%
summarise(
payroll = sum(payroll)/1000,
win = mean(win_pct))
names(datapay)We would normally do a thorough EDA. We skip that portion of the data analysis and get to the regression problem directly.
Scatter plots show the relationship between \(x\) variable payroll and \(y\) variable win.
plot(x = datapay$payroll,
y = datapay$win,
pch = 16, # "point character": shape/character of points
cex = 0.8, # size
col = "blue", # color
xlab = "Payroll", # x-axis
ylab = "Win", # y-axis
main = "MLB Teams's Overall Win vs. Payroll") # title
text(datapay$payroll, datapay$win, labels=datapay$team, cex=0.7, pos=1) # label all pointsWe notice the positive association: when payroll increases, so does win.
ggplot
ggplot(datapay) +
geom_point(aes(x = payroll, y = win), color = "blue") +
geom_text_repel(aes(x = payroll, y = win, label=team), size=3) +
labs(title = "MLB Teams's Overall Win vs. Payroll", x = "Payroll", y = "Win")Often we would like to explore the relationship between two variables. Will a team perform better when they are paid more? The simplest model is a linear model. Let the response \(y_i\) be the win and the explanatory variable \(x_i\) be payroll (\(i = 1, \dots, n=30\)).
Assume there is linear relationship between win and payroll, i.e.
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
Model interpretation:
payroll, on average the win is a linear functionwin is the average plus an error termEstimation
Once we have produce an estimate \((\hat\beta_0, \hat\beta_1)\), we can then
win andwin for a team based on the payroll\[\hat y_i = \hat\beta_0 + \hat\beta_1 x_i.\]
How to estimate the parameters using the data we have? For example, how would you decide on the following three estimates?
Given an estimate \((b_0, b_1)\), we first define residuals as the differences between actual and predicted values of the response \(y\), i.e.
\[\hat{\epsilon}_i = \hat{y}_i -b_0 - b_1 x_i.\] In previous plots, the residuals are the vertical lines between observations and the fitted line. Now we are ready to define the OLS estimate.
The OLS estimates \(\hat\beta_0\) and \(\hat\beta_1\) are obtained by minimizing sum of squared errors (RSS):
\[(\hat\beta_0, \hat\beta_1) = \arg\min_{b_0,\,b_1} \sum_{i=1}^{n}\hat{\epsilon}_i^2 = \arg\min_{b_0,\,b_1} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^{2}.\]
We can derive the solution of \(\hat\beta_0\) and \(\hat\beta_1\).
\[\begin{split} \hat\beta_0 &= \bar{y} - \hat\beta_1 \bar{x} \\ \hat\beta_1 &= r_{xy} \cdot \frac{s_y}{s_x} \end{split}\]
where
The following shiny application shows how the fitted line and RSS change as we change \((b_0, b_1)\). (The following chunk is not required but for the purpose of demonstration. RUN THE CHUNK IN THE CONSOLE to let R session to serve as a server (Copy the chunk and paste in the Console). If you are interested, this is a step-by-step tutorial to get you started.)
lm()The function lm() will be used extensively. This function solves the minimization problem that we defined above. Below we use win as the dependent \(y\) variable and payroll as our \(x\).
As we can see from the below output, this function outputs a list of many statistics. We will define these statistics later
We can also view a summary of the lm() output by using the summary() command.
##
## Call:
## lm(formula = win ~ payroll, data = datapay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03932 -0.01356 -0.00134 0.00748 0.06660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4175 0.0148 28.23 < 2e-16 ***
## payroll 0.0621 0.0106 5.88 2.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0256 on 28 degrees of freedom
## Multiple R-squared: 0.553, Adjusted R-squared: 0.537
## F-statistic: 34.6 on 1 and 28 DF, p-value: 2.52e-06
##
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
Notice that the outputs of myfit0 and summary(myfit0) are different
##
## Call:
## lm(formula = win ~ payroll, data = datapay)
##
## Coefficients:
## (Intercept) payroll
## 0.4175 0.0621
To summarize the OLS estimate, \(\hat{\beta}_0 =\) 0.417 and \(\hat{\beta}_1 =\) 0.062, we have the following estimator:
\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i = 0.423 + 0.061 \cdot x_i\]
Here are what we can say:
Interpretation of the slope:
When payroll increases by 1 unit (1 billion), we expect, on average the win will increase about 0.062.
When payroll increases by .5 unit (500 million), we expect, on average the win will increase about 0.031.
Prediction equation:
payroll is 0.888, we estimate on average the win to be\[\hat{y}_{\text{Oakland Athletics}} = 0.423 + 0.061 \times 0.841 = .474\] Or we can inline the solution from the function output:
\[\hat{y}_{\text{Oakland Athletics}} = 0.417 + 0.062 \times 0.888.\]
win is 0.888. So the residual for team Oakland Athletics is\[\hat{\epsilon}_{\text{Oakland}}= .545 - .474 = .071\] or
Here are a few rows that show the fitted values from our model
data.frame(datapay$team, datapay$payroll, datapay$win, myfit0$fitted,
myfit0$res)[15:25, ] # show a few rowsScatter plot with the LS line added
Base R
plot(datapay$payroll, datapay$win,
pch = 16,
xlab = "Payroll",
ylab = "Win Percentage",
main = "MLB Teams's Overall Win Percentage vs. Payroll")
abline(myfit0, col="red", lwd=4) # many other ways.
abline(h=mean(datapay$win), lwd=5, col="blue") # add a horizontal line, y=mean(y)ggplot
# find the row index of Oakland Athletics and Boston Red Sox
oakland_index <- which(datapay$team=="Oakland Athletics")
redsox_index <- which(datapay$team=="Boston Red Sox")
ggplot(datapay, aes(x = payroll, y = win)) +
geom_point() +
geom_smooth(method="lm", se = T, color = "red") +
geom_hline(aes(yintercept = mean(win)), color = "blue") +
annotate(geom="text",
x=datapay$payroll[oakland_index],
y=datapay$win[oakland_index],
label="Oakland A's",color="blue", vjust = -1) +
annotate(geom="text",
x=datapay$payroll[redsox_index],
y=datapay$win[redsox_index],
label="Red Sox",color="red", vjust = -1) +
labs(title = "MLB Teams's Overall Win vs. Payroll",
x = "Payroll", y = "Win")## `geom_smooth()` using formula 'y ~ x'
HERE is how the article concludes that Beane is worth as much as the GM in Red Sox. By looking at the above plot, Oakland A’s win pct is more or less same as that of Red Sox, so based on the LS equation, the team should have paid 2 billion. Do you agree on this statement?????
How well does the linear model fit the data? A common, popular notion is through \(R^2\).
Residual Sum of Squares (RSS):
The least squares approach chooses \(\hat\beta_0\) and \(\hat\beta_1\) to minimize the RSS. RSS is defined as:
\[RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^{2} = \sum_{i=1}^{n} (y_i - \hat\beta_0 - \hat\beta_1 x_i)^{2}\]
## [1] 0.0184
Mean Squared Error (MSE):
Mean Squared Error (MSE) is the average of the squares of the errors, i.e. the average squared difference between the estimated values and the actual values. For simple linear regression, MSE is defined as:
\[MSE = \frac{RSS}{n-2}.\]
Residual Standard Error (RSE)/Root-Mean-Square-Erro(RMSE):
Residual Standard Error (RSE) is the square root of MSE. For simple linear regression, RSE is defined as:
\[RSE = \sqrt{MSE} = \sqrt{\frac{RSS}{n-2}}.\]
## [1] 0.0256
## [1] 0.0256
Total Sum of Squares (TSS):
TSS measures the total variance in the response \(Y\), and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression.
\[TSS = \sum_{i=1}^{n} (y_i - \bar{y_i})^{2}\]
## [1] 0.0411
\(R^2\):
\(R^{2}\) measures the proportion of variability in \(Y\) that can be explained using \(X\). An \(R^{2}\) statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression.
\[R^{2} = \frac{TSS - RSS}{TSS}\]
(TSS-RSS)/TSS # Percentage reduction of the total errors
(cor(datapay$win, myfit0$fit))^2 # Square of the cor between response and fitted values
summary(myfit0)$r.squared## [1] 0.553
## [1] 0.553
## [1] 0.553
Remarks:
How large \(R^2\) needs to be so that you are comfortable to use the linear model? Though \(R^2\) is a very popular notion of goodness of fit, but it has its limitation. Mainly all the sum of squared errors defined so far are termed as Training Errors. It really only measures how good a model fits the data that we use to build the model. It may not generate well to unseen data.
One of the most important aspect about statistics is to realize the estimators or statistics we propose such as the least squared estimators for the slope and the intercept they change as a function of data. Understanding the variability of the statistics, providing the accuracy of the estimators are one of the focus as statisticians.
Recall that we assume a linear, \[y_i = \beta_0 + \beta_1 x_i + \epsilon_i.\] We did not impose assumptions on \(\epsilon_i\) when using OLS. In order to provide some desired statistical properties and guarantees to our OLS estimate \((\hat\beta_0, \hat\beta_1)\), we need to impose assumptions.
\[\textbf{E}(y_i | x_i) = \beta_0 + \beta_1 x_i\]
\[\textbf{Var}(y_i | x_i) = \sigma^2\]
\[\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)\] or
\[y_i \overset{iid}{\sim} \mathcal{N}( \beta_0 + \beta_1 x_i, \sigma^2)\]
Under the model assumptions:
The OLS estimates \(\hat \beta = (\hat\beta_0, \hat\beta_1)\) has the following properties:
\[\textbf{E}(\hat{\beta}) = \beta\]
\[\hat{\beta}_1 \sim \mathcal{N}(\beta_1, \textbf{Var}(\hat{\beta}_1))\] where
\[\textbf{Var}(\hat{\beta}_1) = \frac{\sigma^{2}}{x_x^2} = \frac{\sigma^{2}} {\sum_{i=1}^{n} (x_i - \bar{x})^{2}}.\]
In general, \[\hat\beta \sim \mathcal{N} (\beta,\ \sigma^2 (X^TX)^{-1})\] Here \(X\) is the design matrix where the first column is 1’s and the second column is the values of x, in our case it is the column of payrolls1.
Confidence intervals for the coefficients
\(t\)-interval and \(t\)-test can be constructed using the above results.
For example, the 95% confidence interval for \(\beta\) approximately takes the form
\[\hat\beta \pm 2 \cdot SE(\hat\beta).\]
Tests to see if the slope is 0:
We can also perform hypothesis test on the coefficients. To be specific, we have the following test.
\[H_0: \beta_{1}=0 \mbox{ v.s. } H_1: \beta_{1} \not= 0\]
To test the null hypothesis, we need to decide whether \(\hat \beta_1\) is far away from 0, which depends on \(SE(\hat \beta_1)\). We now define the test statistics as follows.
\[t = \frac{\hat\beta_1 - 0}{SE(\hat \beta_1)}\]
Under the null hypothesis \(\beta_{1}=0\), \(t\) will have a \(t\)-distribution with \((n-2)\) degrees of freedom. Now we can compute the probability of \(T\sim t_{n-2}\) equal to or larger than \(|t|\), which is termed \(p-value\). Roughly speaking, a small \(p\)-value means the odd of \(\beta_{1}=0\) is small, then we can reject the null hypothesis.
\(SE(\beta)\), \(t\)-value and \(p\)-value are included in the summary output.
##
## Call:
## lm(formula = win ~ payroll, data = datapay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03932 -0.01356 -0.00134 0.00748 0.06660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4175 0.0148 28.23 < 2e-16 ***
## payroll 0.0621 0.0106 5.88 2.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0256 on 28 degrees of freedom
## Multiple R-squared: 0.553, Adjusted R-squared: 0.537
## F-statistic: 34.6 on 1 and 28 DF, p-value: 2.52e-06
The confint() function returns the confident interval for us (95% confident interval by default).
## 2.5 % 97.5 %
## (Intercept) 0.3872 0.4478
## payroll 0.0405 0.0838
## 0.5 % 99.5 %
## (Intercept) 0.3766 0.4583
## payroll 0.0329 0.0913
We use a confidence interval to quantify the uncertainty surrounding the mean of the response (win). For example, for teams like Oakland A’s whose payroll=.841, a 95% Confidence Interval for the mean of response win is
\[\hat{y}_{|x=.841} \pm t_{(\alpha/2, n-2)} \times \sqrt{MSE \times \left(\frac{1}{n} + \frac{(.841-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right)}.\]
The predict() provides prediction with confidence interval using the argument interval="confidence".
new <- data.frame(payroll=c(.841)) #new <- data.frame(payroll=c(1.24))
CImean <- predict(myfit0, new, interval="confidence", se.fit=TRUE)
CImean## $fit
## fit lwr upr
## 1 0.47 0.455 0.484
##
## $se.fit
## [1] 0.00695
##
## $df
## [1] 28
##
## $residual.scale
## [1] 0.0256
Because \[\textbf{Var}(\hat{y}_{|x}) = \sigma^{2} \Bigg(\frac{1}{n}+\frac{(x-\bar{x})^2}{\sum_{i=1}^{n} (x_i - \bar{x})^{2}}\Bigg).\]
We can show the confidence interval for the mean response using ggplot() with geom_smooth() using the argument se=TRUE.
# find the row index of Oakland Athletics and Boston Red Sox
oakland_index <- which(datapay$team=="Oakland Athletics")
redsox_index <- which(datapay$team=="Boston Red Sox")
ggplot(datapay, aes(x = payroll, y = win)) +
geom_point() +
geom_smooth(method="lm", se = TRUE, level = 0.95, color = "red") +
geom_hline(aes(yintercept = mean(win)), color = "blue") +
annotate(geom="text",
x=datapay$payroll[oakland_index],
y=datapay$win[oakland_index],
label="Oakland A's",color="green", vjust = -1) +
annotate(geom="text",
x=datapay$payroll[redsox_index],
y=datapay$win[redsox_index],
label="Red Sox",color="red", vjust = -1) +
labs(title = "MLB Teams's Overall Win vs. Payroll",
x = "Payroll", y = "Win")## `geom_smooth()` using formula 'y ~ x'
A prediction interval can be used to quantify the uncertainty surrounding win for a particular team.
\[\hat{y}_{|x} \pm t_{(\alpha/2, n-2)} \times \sqrt{MSE \times \left( 1+\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right)}\]
We now produce 95% & 99% PI for a future \(y\) given \(x=.841\) using predict() again but with the argument interval="prediction".
new <- data.frame(payroll=c(.841))
CIpred <- predict(myfit0, new, interval="prediction", se.fit=TRUE)
CIpred
CIpred_99 <- predict(myfit0, new, interval="prediction", se.fit=TRUE, level=.99)
CIpred_99## $fit
## fit lwr upr
## 1 0.47 0.415 0.524
##
## $se.fit
## [1] 0.00695
##
## $df
## [1] 28
##
## $residual.scale
## [1] 0.0256
##
## $fit
## fit lwr upr
## 1 0.47 0.396 0.543
##
## $se.fit
## [1] 0.00695
##
## $df
## [1] 28
##
## $residual.scale
## [1] 0.0256
Now we plot the confidence interval (shaded) along with the 95% prediction interval in blue and 99% prediction interval in green.
# find the row index of Oakland Athletics and Boston Red Sox
oakland_index <- which(datapay$team=="Oakland Athletics")
redsox_index <- which(datapay$team=="Boston Red Sox")
pred_int <- predict(myfit0, interval="prediction")## Warning in predict.lm(myfit0, interval = "prediction"): predictions on current data refer to _future_ responses
colnames(pred_int) <- c("fit_95", "lwr_95", "upr_95")
pred_int_99 <- predict(myfit0, interval="prediction", level = .99)## Warning in predict.lm(myfit0, interval = "prediction", level = 0.99): predictions on current data refer to _future_ responses
colnames(pred_int_99) <- c("fit_99", "lwr_99", "upr_99")
cbind(datapay, pred_int, pred_int_99) %>%
ggplot(aes(x = payroll, y = win)) +
geom_point() +
geom_smooth(method="lm", se = TRUE, level = 0.95, color = "red") +
geom_line(aes(y=lwr_95), color = "blue", linetype = "dashed") +
geom_line(aes(y=upr_95), color = "blue", linetype = "dashed") +
geom_line(aes(y=lwr_99), color = "green", linetype = "dashed") +
geom_line(aes(y=upr_99), color = "green", linetype = "dashed") +
annotate(geom="text",
x=datapay$payroll[oakland_index],
y=datapay$win[oakland_index],
label="Oakland A's",color="blue", vjust = -1) +
annotate(geom="text",
x=datapay$payroll[redsox_index],
y=datapay$win[redsox_index],
label="Red Sox",color="red", vjust = -1) +
labs(title = "MLB Teams's Overall Win vs. Payroll",
x = "Payroll", y = "Win")## `geom_smooth()` using formula 'y ~ x'
From our output above, the 95% prediction interval varies from \(.41\) to \(.53\) for a team like the Oakland A’s. But its win is \(.54\). So it is somewhat unusual but not that unusual! The 99% prediction interval contains the true Oakland winning percentage of 0.539.
Read page 82 of ILSR to fully understand the difference between confidence and prediction intervals.
How reliable our confidence intervals and the tests are? We will need to check the model assumptions in the following steps:
Check linearity first; if linearity is satisfied, then
Check homoscedasticity; if homoscedasticity is satisfied, then
Check normality.
We plot the residuals against the fitted values to
check linearity by checking whether the residuals follow a symmetric pattern with respect to \(h=0\).
check homoscedasticity by checking whether the residuals are evenly distributed within a band.
plot(myfit0$fitted, myfit0$residuals,
pch = 16,
main = "residual plot")
abline(h=0, lwd=4, col="red")We look at the qqplot of residuals to check normality.
R an ggplot2 come in handy.lm() function is one of the most important tools for statisticians.We have put a few topics here. Some of the sections might be covered in the class.
Now we understand more about regression method. If one wants to predict payroll using win as predictor can we solve for payroll using the LS equation above to predict payroll?
The answer is NO, and why not?
We first plot our original model:
\[y_{win} = 0.42260 + 0.06137 \cdot x_{payroll}\]
par(mgp=c(1.8,.5,0), mar=c(3,3,2,1))
plot(datapay$payroll, datapay$win,
pch = 16,
xlab = "Payroll",
ylab = "Win Percentage",
main = "MLB Teams's Overall Win Percentage vs. Payroll")
abline(lm(win ~ payroll, data=datapay), col="red", lwd=4) Now, we reverse the regression and look at the summary output
##
## Call:
## lm(formula = payroll ~ win, data = datapay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7891 -0.1967 -0.0326 0.1832 0.6949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.119 0.758 -4.11 0.00031 ***
## win 8.893 1.512 5.88 2.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.306 on 28 degrees of freedom
## Multiple R-squared: 0.553, Adjusted R-squared: 0.537
## F-statistic: 34.6 on 1 and 28 DF, p-value: 2.52e-06
We can now overlay our two regressions:
\[y_{payroll} = -2.7784 + 8.0563 \cdot x_{win}\]
\[y_{win} = 0.42260 + 0.06137 \cdot x_{payroll}\] Notice that this is not the same as solving \(x_{payroll}\) given \(y_{payroll}\) from the first equation which is \[ x_{win} = 1/8.0563 y_{payroll} + 2.7784/8.0563 =0.344873 + .124 y_{payroll} \]
beta0 <- myfit1$coefficients[1]
beta1 <- myfit1$coefficients[2]
win2 <- (datapay$payroll - beta0) / beta1
plot(datapay$payroll, datapay$win,
pch = 16,
xlab = "Payroll",
ylab = "Win Percentage",
main = "MLB Teams's Overall Win Percentage vs. Payroll")
text(datapay$payroll, datapay$win, labels=datapay$team, cex=0.7, pos=2) # label teams
abline(lm(win ~ payroll, data=datapay), col="red", lwd=4)
lines(datapay$payroll, win2, col="green", lwd=4)
legend("bottomright", legend=c("y=winpercent", "y=payroll"),
lty=c(1,1), lwd=c(2,2), col=c("red","green")) Conclusion: Two lines are not the same!!!!!
We may also want to get the LS equation w/o Oakland first.
##
## Call:
## lm(formula = win ~ payroll, data = subdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03685 -0.01097 0.00147 0.01169 0.05281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40800 0.01333 30.61 < 2e-16 ***
## payroll 0.06749 0.00943 7.16 1.1e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0225 on 27 degrees of freedom
## Multiple R-squared: 0.655, Adjusted R-squared: 0.642
## F-statistic: 51.3 on 1 and 27 DF, p-value: 1.07e-07
The plot below shows the effect of Oakland on our linear model.
plot(subdata$payroll, subdata$win,
pch = 16,
xlab = "Payroll", ylab="Win Percentage",
main = "The effect of Oakland")
lines(subdata$payroll, predict(myfit2), col="blue", lwd=3)
abline(myfit0, col="red", lwd=3)
legend("bottomright", legend=c("Reg. with Oakland", "Reg. w/o Oakland"),
lty=c(1,1), lwd=c(2,2), col=c("red","blue"))We may also want to check the effect of Yankees (2.70, .58)
plot(subdata$payroll, subdata$win,
pch = 16,
xlab = "Payroll",
ylab = "Win Percentage",
main = "The effect of Yankees")
abline(myfit3, col="blue", lwd=3)
abline(myfit0, col="red", lwd=3)
legend("bottomright", legend=c("Reg. All", "Reg. w/o Yankees"),
lty=c(1,1), lwd=c(2,2), col=c("red","blue"))Difference between \(z\) and \(t\) with \(df=n\). The distribution of \(z\) is similar to that \(t\) when \(df\) is large, say 30.
Check Normality
See what a \(t\) variable looks like with a degrees of freedom at 5
df <-5 # you may change df: df is large then t is approx same as z
t <- rt(1000, df) #
hist(t, freq=FALSE, col="blue", breaks=50, xlim=c(-5,5),
main=paste("Hist of t with df=",df))Make a small set of graphs to see the variability of sample from sample
# Put graphs into 2 rows and 2 cols
par(mfrow=c(2,2))
for (i in 1:4){
z <- rnorm(1000)
par(mgp=c(1.8,.5,0), mar=c(3,3,2,1))
hist(z, freq=FALSE, col="red", breaks=30, xlim=c(-5,5))
qqnorm(z, pch=16, col="blue", cex=0.7) # check normality
qqline(z)
}A perfect model between X and Y but it is not linear. \(R^{2}=.837\) here \(y=x^{3}\) with no noise!
x <- seq(0, 3, by=.05) # or x=seq(0, 3, length.out=61)
y <- x^3 # no variability
myfit <- lm(y ~ x)
myfit.out <- summary(myfit)
rsquared <- myfit.out$r.squared
plot(x, y, pch=16, ylab="",
xlab = "No noise in the data",
main = paste("R squared = ", round(rsquared, 3), sep=""))
abline(myfit, col="red", lwd=4)A perfect linear model between X and Y but with noise here \(y= 2+3x + \epsilon\), \(\epsilon \overset{iid}{\sim} N(0,9)\). Run this repeatedly
par(mfrow = c(2,2))
for(i in 1:4){
x <- seq(0, 3, by=.02)
e <- 3*rnorm(length(x)) # Normal random errors with mean 0 and sigma=3
y <- 2 + 3*x + 3*rnorm(length(x))
myfit <- lm(y ~ x)
myfit.out <- summary(myfit)
rsquared <- round(myfit.out$r.squared,3)
hat_beta_0 <- round(myfit$coe[1], 2)
hat_beta_1 <- round(myfit$coe[2], 2)
par(mgp=c(1.8,.5,0), mar=c(3,3,2,1))
plot(x, y, pch=16, ylab="",
xlab = "True lindear model with errors",
main = paste("R squared= ", rsquared,
"LS est's=", hat_beta_0, "and", hat_beta_1))
abline(myfit, col="red", lwd=4)
}Similar setup but with smaller variance \(\epsilon \overset{iid}{\sim} N(0,1)\).
par(mfrow = c(2,2))
for(i in 1:4){
x <- seq(0, 3, by=.02)
e <- 3*rnorm(length(x)) # Normal random errors with mean 0 and sigma=3
y <- 2 + 3*x + 1*rnorm(length(x))
myfit <- lm(y ~ x)
myfit.out <- summary(myfit)
rsquared <- round(myfit.out$r.squared, 3)
b1 <- round(myfit.out$coe[2], 3)
par(mgp=c(1.8,.5,0), mar=c(3,3,2,1))
plot(x, y, pch=16,
ylab = "",
xlab = paste("LS estimates, b1=", b1, ", R^2=", rsquared),
main = "The true model is y=2+3x+n(0,1)")
abline(myfit, col="red", lwd=4)
}What do we expect to see even all the model assumptions are met?
We demonstrate this through a simulation.
Here is a case that all the linear model assumptions are met. Once again everything can be checked by examining the residual plots
## Set up the simulations
set.seed(1) # set seed so that x will be the same each time we run it
x <- runif(100) # generate 100 random numbers from [0, 1], hist(x)
hist(x)\[y = 1 +2x + \mathcal{N}(0,2^2) \quad i.e. \quad \beta_0 = 1 ,\beta_1 = 2, \sigma^2 = 4.\]
# mar: c(bottom, left, top, right) to specify the margin on four sides
# mgp: specify the margin for axis title, axis labels and axis line
# par(mfrow=c(2,2), mar=c(2,2,2,2), mgp=c(2,0.5,0))
# for (i in 1:4) {
y <- 1 + 2*x + rnorm(100, 0, 2) # generate response y's. may change n
fit <- lm(y ~ x)
fit.perfect <- summary(lm(y ~ x))
rsquared <- round(fit.perfect$r.squared, 2)
hat_beta_0 <- round(fit.perfect$coefficients[1], 2)
hat_beta_1 <- round(fit.perfect$coefficients[2], 2)
hat_sigma <- round(fit.perfect$sigma, 2)
plot(x, y, pch=16,
ylim=c(-8,8),
xlab="a perfect linear model: true mean: y=1+2x in blue, LS in red",
main=paste("R squared= ",rsquared, ",",
"hat sig=", hat_sigma, ",", "LS b1=",hat_beta_1, ",", "and b0=", hat_beta_0))
abline(fit, lwd=4, col="red")
lines(x, 1+2*x, lwd=4, col="blue") #
#
# plot(x, y, pch=i,
# ylim = c(-8,8),
# xlab = "true mean: y=1+2x in blue, LS in red",
# main = paste("R^2=", rsquared,
# ", b1_LSE=", hat_beta_1, " and b0=", hat_beta_0))
# abline(fit, lwd=4, col="red")
# lines(x, 1 + 2*x, lwd=4, col="blue")
# abline(h= mean(y), lwd=4, col="green")
# }The theory says that
\[\hat\beta_1 \sim \mathcal{N} (\beta_1 = 2,\, \textbf{Var}(\hat\beta_1)).\]
sigma <- 2
n <- length(y)
sd_b1 <- sqrt(sigma^2 /((n-1)* (sd(x))^2)) # we will estimate sigma by rse in real life.
sd_b1
summary(fit)Plots
plot(x, y, pch=i,
ylim = c(-8,8),
xlab = "a perfect linear model: true mean: y=1+2x in blue, LS in red",
main = paste("R squared=", rsquared,
", LS estimates b1=", hat_beta_1, " and b0=", hat_beta_0))
abline(fit, lwd=4, col="red")
lines(x, 1 + 2*x, lwd=4, col="blue")
abline(h= mean(y), lwd=4, col="green") # Residuals
plot(fit$fitted, fit$residuals,
pch = 16,
ylim = c(-8, 8),
main = "residual plot")
abline(h=0, lwd=4, col="red")We remind the readers definition of sample statistics here.
\[\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i\]
\[s^2 = \frac{\sum_{i=1}^{n}(y_i - \bar{y})^2}{n-1}\]
\[s = \sqrt\frac{\sum_{i=1}^{n}(y_i - \bar{y})^2} {n - 1}\]
\[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{s_x s_y} \]